In this section I will demonstrate how to retrieve the dataset off the internet, load the data into the jupyter environment, and wrangle with the data to tidy it in order to prepare for analysis.
# Begin by importing the necessary modules
import pandas as pd
import numpy as np
import requests
import zipfile
from io import BytesIO
In this section, I will download a zip file hosted on github that contains the dataset. I then will extract the files out of the zipfile into the current directory. Once extracted, I load the .csv files into the environment.
# Define the url to the .zip file
url = 'https://github.com/chadwickbureau/baseballdatabank/archive/master.zip'
# Downloading the file with the requests library
req = requests.get(url)
# Extracting the dataset from the downloaded zip file
zipfile= zipfile.ZipFile(BytesIO(req.content))
zipfile.extractall('./')
# Loading the .csv files necessary for analysis into pandas dataframes
df_batting = pd.read_csv("baseballdatabank-master/core/Batting.csv")
df_fielding = pd.read_csv("baseballdatabank-master/core/Fielding.csv")
df_pitching = pd.read_csv("baseballdatabank-master/core/Pitching.csv")
df_hof = pd.read_csv("baseballdatabank-master/core/HallOfFame.csv")
Now that the data has been downloaded and is housed in a pandas dataframe, it is time to wrangle the data to fit our needs. For the purpose of formalizing Hall of Fame criteria the goal is to compose a table that houses all relevant career stats for each player throughout history. The dataset structure separates various types of stats into different tables. For instance, batting, fielding, and pitching each have their own table. We will want to combine these to represent a players career in one row, all in one table.
I have chosen to separate players into two distinct groups: Pitchers and Position Players. Historically in baseball there are exceptionally few players to ever pitch and play a non-pitcher position. Even fewer players to ever do this with any sort of consistency. It makes most sense to separate these players and treat them differently. They are also treated differently by pundits and have entirely different criteria for evaluation.
We also will want to add the column "HoF" for each player which will be a boolean value "True" or "False" that will define whether the player is a Hall of Famer or not. This will be necessary for the machine learning section where a classifier is built on labeled training data.
This is the initial state of the dataframe as given by the .csv file.
df_batting.head()
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | abercda01 | 1871 | 1 | TRO | NaN | 1 | 4 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | 0.0 |
| 1 | addybo01 | 1871 | 1 | RC1 | NaN | 25 | 118 | 30 | 32 | 6 | ... | 13.0 | 8.0 | 1.0 | 4 | 0.0 | NaN | NaN | NaN | NaN | 0.0 |
| 2 | allisar01 | 1871 | 1 | CL1 | NaN | 29 | 137 | 28 | 40 | 4 | ... | 19.0 | 3.0 | 1.0 | 2 | 5.0 | NaN | NaN | NaN | NaN | 1.0 |
| 3 | allisdo01 | 1871 | 1 | WS3 | NaN | 27 | 133 | 28 | 44 | 10 | ... | 27.0 | 1.0 | 1.0 | 0 | 2.0 | NaN | NaN | NaN | NaN | 0.0 |
| 4 | ansonca01 | 1871 | 1 | RC1 | NaN | 25 | 120 | 29 | 39 | 11 | ... | 16.0 | 6.0 | 2.0 | 2 | 1.0 | NaN | NaN | NaN | NaN | 0.0 |
5 rows × 22 columns
In the initial state shown above, each row represents one player's batting stats for one year. I wish to condense this table into each row representing one player's career statistics. To do this I will need to take every row for each playerID and sum the stats into one cumulative row for their career. This is valid because all statistics are numerical values where summing is valid.
# Removing columns that are unnecessary for my analysis to limit clutter.
df_batting_dropped = df_batting.drop(columns = ["yearID", "stint", "teamID", "lgID"])
# This line groups by playerID and then sums the columns of each group. The effect of this is taking
# all rows with equal playerIDs and combining them into one, cumulative row for the player's career.
df_batting_career = df_batting_dropped.groupby(["playerID"], as_index=False).sum()
# Later in the fielding section, there are duplicate column names for "G", "SB", and "CS" but
# they mean different things. I rename them to "G_b" for G_batting, etc. for clarity.
df_batting_career = df_batting_career.rename(columns={"G": "G_b", "SB": "SB_b", "CS": "CS_b"})
# Here is the result of this stage, we now have career batting stats in each row!
df_batting_career.head()
| playerID | G_b | AB | R | H | 2B | 3B | HR | RBI | SB_b | CS_b | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | aardsda01 | 331 | 4 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 1 | aaronha01 | 3298 | 12364 | 2174 | 3771 | 624 | 98 | 755 | 2297.0 | 240.0 | 73.0 | 1402 | 1383.0 | 293.0 | 32.0 | 21.0 | 121.0 | 328.0 |
| 2 | aaronto01 | 437 | 944 | 102 | 216 | 42 | 6 | 13 | 94.0 | 9.0 | 8.0 | 86 | 145.0 | 3.0 | 0.0 | 9.0 | 6.0 | 36.0 |
| 3 | aasedo01 | 448 | 5 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | abadan01 | 15 | 21 | 1 | 2 | 0 | 0 | 0 | 0.0 | 0.0 | 1.0 | 4 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
In this section I will perform the process shown above, but on fielding data. The structure of the data and method for wrangling is very similar.
# Initial state
df_fielding.head()
| playerID | yearID | stint | teamID | lgID | POS | G | GS | InnOuts | PO | A | E | DP | PB | WP | SB | CS | ZR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | abercda01 | 1871 | 1 | TRO | NaN | SS | 1 | 1.0 | 24.0 | 1 | 3 | 2.0 | 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | addybo01 | 1871 | 1 | RC1 | NaN | 2B | 22 | 22.0 | 606.0 | 67 | 72 | 42.0 | 5 | NaN | NaN | NaN | NaN | NaN |
| 2 | addybo01 | 1871 | 1 | RC1 | NaN | SS | 3 | 3.0 | 96.0 | 8 | 14 | 7.0 | 0 | NaN | NaN | NaN | NaN | NaN |
| 3 | allisar01 | 1871 | 1 | CL1 | NaN | 2B | 2 | 0.0 | 18.0 | 1 | 4 | 0.0 | 0 | NaN | NaN | NaN | NaN | NaN |
| 4 | allisar01 | 1871 | 1 | CL1 | NaN | OF | 29 | 29.0 | 729.0 | 51 | 3 | 7.0 | 1 | NaN | NaN | NaN | NaN | NaN |
One uniqueness to fielding data that poses a new challenge is that there is a new column "POS" that describes the fielders position. This is problematic because we cannot "sum" qualitative position values over a career in any meaningful way. I chose to handle this by assigning each player one career position, their most popular. This is likely how players are remembered and defined. I do this by checking the players most played position by Games Played.
# Dropping unnecessary columns
df_fielding_dropped = df_fielding.drop(columns = ["yearID", "stint", "teamID", "lgID"])
# This groups by playerID as seen before, but now we also group by position so that we obtain
# career stats for each player and each position. I then sort these values by playerID, then
# Games. The effect of this is having the first row of each player being their most played position.
df_fielding_career_by_pos = df_fielding_dropped.groupby(["playerID", "POS"], as_index=False).sum() \
.sort_values(['playerID', 'G'], ascending=[True, False])
# The resulting table.
df_fielding_career_by_pos.head()
| playerID | POS | G | GS | InnOuts | PO | A | E | DP | PB | WP | SB | CS | ZR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | aardsda01 | P | 331 | 0.0 | 1011.0 | 11 | 29 | 3.0 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | aaronha01 | OF | 2760 | 2735.0 | 72074.0 | 5539 | 201 | 117.0 | 41 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | aaronha01 | 1B | 210 | 203.0 | 5274.0 | 1791 | 117 | 20.0 | 152 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | aaronha01 | 2B | 43 | 34.0 | 913.0 | 102 | 99 | 7.0 | 25 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | aaronha01 | 3B | 7 | 5.0 | 153.0 | 4 | 12 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
The purpose of this following section is to select each player's most played defensive position. I do this by doing another groupby playerID but this time instead of summing all columns, I will sum all columns except for "POS". For position I define the function to aggregate with to be "first". This will select the position value first seen, which will be there most played position as setup above.
# Define a dictionary of aggregation functions
# This maps from column names to aggregation function. It defines all columns to be summed
f = dict.fromkeys(df_fielding_career_by_pos, 'sum')
# Explicitly remove playerID because this is the variable we are grouping on
f.pop("playerID")
# Explicitly define POS to be aggregated with the "first" function
f["POS"] = "first"
# Carry out grouping and aggregation using f
df_fielding_career = df_fielding_career_by_pos.groupby(["playerID"], as_index=False).agg(f)
# Rename columns as described in the batting section
df_fielding_career = df_fielding_career.rename(columns={"G": "G_f", "SB": "SB_f", "CS": "CS_f"})
# The resulting career stats table!
df_fielding_career.head()
| playerID | POS | G_f | GS | InnOuts | PO | A | E | DP | PB | WP | SB_f | CS_f | ZR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | aardsda01 | P | 331 | 0.0 | 1011.0 | 11 | 29 | 3.0 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | aaronha01 | OF | 3020 | 2977.0 | 78414.0 | 7436 | 429 | 144.0 | 218 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | aaronto01 | 1B | 387 | 206.0 | 6472.0 | 1317 | 113 | 22.0 | 124 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | aasedo01 | P | 448 | 91.0 | 3328.0 | 67 | 135 | 13.0 | 10 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | abadan01 | 1B | 9 | 4.0 | 138.0 | 37 | 1 | 1.0 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
In this section we will merge the tables for career batting and fielding stats to create one all-encompassing table for a player's career. I choose to merge with an outer join in order to preserve any players that may have only batting or only fielding stats. We would not want these players to be removed from our table.
# Merge the two career tables with an outer join on playerID
df_batting_and_fielding_career = df_batting_career.merge(df_fielding_career, how='outer', on='playerID')
df_batting_and_fielding_career.head()
| playerID | G_b | AB | R | H | 2B | 3B | HR | RBI | SB_b | ... | InnOuts | PO | A | E | DP | PB | WP | SB_f | CS_f | ZR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | aardsda01 | 331 | 4 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | ... | 1011.0 | 11.0 | 29.0 | 3.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | aaronha01 | 3298 | 12364 | 2174 | 3771 | 624 | 98 | 755 | 2297.0 | 240.0 | ... | 78414.0 | 7436.0 | 429.0 | 144.0 | 218.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | aaronto01 | 437 | 944 | 102 | 216 | 42 | 6 | 13 | 94.0 | 9.0 | ... | 6472.0 | 1317.0 | 113.0 | 22.0 | 124.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | aasedo01 | 448 | 5 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | ... | 3328.0 | 67.0 | 135.0 | 13.0 | 10.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | abadan01 | 15 | 21 | 1 | 2 | 0 | 0 | 0 | 0.0 | 0.0 | ... | 138.0 | 37.0 | 1.0 | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 31 columns
As the last step of the wrangling process we must define whether a player has been inducted into the MLB Hall of Fame for the sake of machine learning and classification. I do this by creating a dictionary that maps every MLB player in history to a boolean that will be "True" if they are a Hall of Famer and "False" if they are not.
# Initialize dict
is_hof = dict()
# Set all values to False initially, to be revised below
for player in df_batting_and_fielding_career['playerID']:
is_hof[player] = False
# Loop through the Hall of Fame table from our dataset to obtain info on HoF Induction status.
for row in df_hof.iterrows():
# If the player has been inducted AND the player is in our dictionary, set their value to true.
# The second condition of being in the dict is necessary because not all Hall of Fame inductees
# are players that we care about for our analysis. Some inductees are managers, executives, or
# other non-player roles in baseball.
if row[1]['inducted'] == 'Y' and row[1]["playerID"] in is_hof.keys():
is_hof[row[1]["playerID"]] = True
# Create a 'HoF' column that is set to the values of our dictionary.
# The tables will align because the dictionary was initialized in the ordering
# of 'playerID' in the career stats table.
df_batting_and_fielding_career['HoF'] = is_hof.values()
# Our resulting table with HoF values!
df_batting_and_fielding_career.head()
| playerID | G_b | AB | R | H | 2B | 3B | HR | RBI | SB_b | ... | PO | A | E | DP | PB | WP | SB_f | CS_f | ZR | HoF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | aardsda01 | 331 | 4 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | ... | 11.0 | 29.0 | 3.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | False |
| 1 | aaronha01 | 3298 | 12364 | 2174 | 3771 | 624 | 98 | 755 | 2297.0 | 240.0 | ... | 7436.0 | 429.0 | 144.0 | 218.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | True |
| 2 | aaronto01 | 437 | 944 | 102 | 216 | 42 | 6 | 13 | 94.0 | 9.0 | ... | 1317.0 | 113.0 | 22.0 | 124.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | False |
| 3 | aasedo01 | 448 | 5 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | ... | 67.0 | 135.0 | 13.0 | 10.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | False |
| 4 | abadan01 | 15 | 21 | 1 | 2 | 0 | 0 | 0 | 0.0 | 0.0 | ... | 37.0 | 1.0 | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | False |
5 rows × 32 columns
We have now created the table for position players! Great, now it is time to create the table for pitchers. This will involve similar processes but will be notably easier because all stats are found in one table rather than being divided between two tables for batting and fielding as for position players.
We begin by viewing the initial state of the table.
df_pitching.head()
| playerID | yearID | stint | teamID | lgID | W | L | G | GS | CG | ... | IBB | WP | HBP | BK | BFP | GF | R | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | bechtge01 | 1871 | 1 | PH1 | NaN | 1 | 2 | 3 | 3 | 2 | ... | NaN | 7 | NaN | 0 | 146.0 | 0 | 42 | NaN | NaN | NaN |
| 1 | brainas01 | 1871 | 1 | WS3 | NaN | 12 | 15 | 30 | 30 | 30 | ... | NaN | 7 | NaN | 0 | 1291.0 | 0 | 292 | NaN | NaN | NaN |
| 2 | fergubo01 | 1871 | 1 | NY2 | NaN | 0 | 0 | 1 | 0 | 0 | ... | NaN | 2 | NaN | 0 | 14.0 | 0 | 9 | NaN | NaN | NaN |
| 3 | fishech01 | 1871 | 1 | RC1 | NaN | 4 | 16 | 24 | 24 | 22 | ... | NaN | 20 | NaN | 0 | 1080.0 | 1 | 257 | NaN | NaN | NaN |
| 4 | fleetfr01 | 1871 | 1 | NY2 | NaN | 0 | 1 | 1 | 1 | 1 | ... | NaN | 0 | NaN | 0 | 57.0 | 0 | 21 | NaN | NaN | NaN |
5 rows × 30 columns
I begin by removing unnecessary columns and summing career statistics for each player as we did for position players.
# Removing columns that are unnecessary for my analysis to limit clutter.
df_pitching_dropped = df_pitching.drop(columns = ["yearID", "stint", "teamID", "lgID"])
df_pitching_career = df_pitching_dropped.groupby(["playerID"], as_index=False).sum()
df_pitching_career.head()
| playerID | W | L | G | GS | CG | SHO | SV | IPouts | H | ... | IBB | WP | HBP | BK | BFP | GF | R | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | aardsda01 | 16 | 18 | 331 | 0 | 0 | 0 | 69 | 1011 | 296 | ... | 22.0 | 12 | 16.0 | 1 | 1475.0 | 141 | 169 | 17.0 | 11.0 | 21.0 |
| 1 | aasedo01 | 66 | 60 | 448 | 91 | 22 | 5 | 82 | 3328 | 1085 | ... | 45.0 | 22 | 7.0 | 3 | 4730.0 | 235 | 503 | 50.0 | 34.0 | 106.0 |
| 2 | abadfe01 | 8 | 29 | 384 | 6 | 0 | 0 | 2 | 992 | 309 | ... | 10.0 | 10 | 12.0 | 2 | 1399.0 | 97 | 143 | 7.0 | 12.0 | 25.0 |
| 3 | abbeybe01 | 22 | 40 | 79 | 65 | 52 | 0 | 1 | 1704 | 686 | ... | 0.0 | 18 | 26.0 | 0 | 2568.0 | 14 | 442 | 0.0 | 0.0 | 0.0 |
| 4 | abbeych01 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 6 | ... | 0.0 | 1 | 0.0 | 0 | 12.0 | 1 | 3 | 0.0 | 0.0 | 0.0 |
5 rows × 26 columns
We now have career stats for all the pitchers! All that's left to do is to add a HoF column in the same way we did before.
# Initialize dict
is_hof = dict()
for player in df_pitching_career['playerID']:
is_hof[player] = False
for row in df_hof.iterrows():
if row[1]['inducted'] == 'Y' and row[1]["playerID"] in is_hof.keys():
is_hof[row[1]["playerID"]] = True
df_pitching_career['HoF'] = is_hof.values()
# Our resulting table with HoF values!
df_pitching_career.head()
| playerID | W | L | G | GS | CG | SHO | SV | IPouts | H | ... | WP | HBP | BK | BFP | GF | R | SH | SF | GIDP | HoF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | aardsda01 | 16 | 18 | 331 | 0 | 0 | 0 | 69 | 1011 | 296 | ... | 12 | 16.0 | 1 | 1475.0 | 141 | 169 | 17.0 | 11.0 | 21.0 | False |
| 1 | aasedo01 | 66 | 60 | 448 | 91 | 22 | 5 | 82 | 3328 | 1085 | ... | 22 | 7.0 | 3 | 4730.0 | 235 | 503 | 50.0 | 34.0 | 106.0 | False |
| 2 | abadfe01 | 8 | 29 | 384 | 6 | 0 | 0 | 2 | 992 | 309 | ... | 10 | 12.0 | 2 | 1399.0 | 97 | 143 | 7.0 | 12.0 | 25.0 | False |
| 3 | abbeybe01 | 22 | 40 | 79 | 65 | 52 | 0 | 1 | 1704 | 686 | ... | 18 | 26.0 | 0 | 2568.0 | 14 | 442 | 0.0 | 0.0 | 0.0 | False |
| 4 | abbeych01 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 6 | ... | 1 | 0.0 | 0 | 12.0 | 1 | 3 | 0.0 | 0.0 | 0.0 | False |
5 rows × 27 columns
In this section we will do some initial exploratory analysis of the data we've cleaned in order to investigate suspected relationships between the data. We will use some libraries to assist with regression as well as data visualization. Visualizing the data is a great way to help us with EDA as we can spot relationships in the data much easier in well composed graphs than in very large tables.
Let's begin our investigation into what we think may cause players to be thought of as Hall of Fame quality. Players who are consistently good are much more likely to be inducted than those who are exceptional for a short period of time. For this reason I believe time in the league is a very important quality. Let's begin by exploring how games played (position players) and innings pitched (pitchers) effect hall of fame induction.
import plotly.express as px
x = [1, 2, 3, 4, 5]
y = [2, 1, 3, 5, 4]
px.scatter(x, y)